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Abstract 

This  paper  derives  tradeoffs  between  three  basic  costs  of  a  parallel  algorithm:  synchronization,  data  movement,  and  computational 
cost.  Our  theoretical  model  counts  the  amount  of  work  and  data  movement  as  a  maximum  of  any  execution  path  during  the  parallel 
computation.  By  considering  this  metric,  rather  than  the  total  communication  volume  over  the  whole  machine,  we  obtain  new  insight 
into  the  characteristics  of  parallel  schedules  for  algorithms  with  non-trivial  dependency  structures.  The  tradeoffs  we  derive  are  lower 
bounds  on  the  execution  time  of  the  algorithm  which  are  independent  of  the  number  of  processors,  but  dependent  on  the  problem  size. 
Therefore,  these  tradeoffs  provide  lower  bounds  on  the  parallel  execution  time  of  any  algorithm  computed  by  a  system  composed  of  any 
number  of  homogeneous  components  each  with  associated  computational,  communication,  and  synchronization  payloads.  We  first  state 
our  results  for  general  graphs,  based  on  expansion  parameters,  then  we  apply  the  theorem  to  a  number  of  specific  algorithms  in  numerical 
linear  algebra,  namely  triangular  substitution,  Gaussian  elimination,  and  Krylov  subspace  methods.  Our  lower  bound  for  LU  factorization 
demonstrates  the  optimality  of  Tiskin’s  LU  algorithm  [24]  answering  an  open  question  posed  in  his  paper,  as  well  as  of  the  2.5D  LU  [20] 
algorithm  which  has  analogous  costs.  We  treat  the  computations  in  a  general  manner  by  noting  that  the  computations  share  a  similar 
dependency  hypergraph  structure  and  analyzing  the  communication  requirements  of  lattice  hypergraph  structures. 


1  Introduction 

We  model  a  parallel  machine  as  a  network  of  processors  which  communicate  by  point-to-point  messages.  This  model  has  three  basic 
architectural  parameters, 

•  a  -  network  latency  (time)  for  a  message  between  a  pair  of  processors, 

•  /?  -  time  to  inject  a  word  of  data  into  (or  extract  it  from)  the  network, 

•  7  -  time  to  perform  a  floating  point  operation  on  local  data, 
which  are  associated  with  three  algorithmic  costs, 

•  S  -  number  of  messages  sent  (network  latency  cost  /  synchronization  cost), 

•  LU  -  number  of  words  of  data  moved  (bandwidth  cost  /  communication  cost), 

•  F  -  number  of  local  floating  point  operations  performed  (computational  cost). 

We  describe  our  execution  schedule  model  and  show  how  S,  W,  and  F  are  measured  in  any  schedule  in  Section  2.  Each  quantity 
is  accumulated  along  some  path  of  dependent  executions  in  the  schedule.  The  sequence  of  executions  done  locally  by  any  processor 
corresponds  to  one  such  path  in  the  schedule,  so  our  costs  are  at  least  as  large  as  those  incurred  by  any  single  processor  during  the 
execution  of  the  schedule.  The  parallel  execution  time  of  the  schedule  is  closely  proportional  to  these  three  quantities,  namely, 

max(S'  ■  a,W  ■  P,  F  ■  y)  <  execution  time  <S'-q;-|-LU-/3-|-F-7. 

Since  our  analysis  will  be  asymptotic,  we  do  not  consider  overlap  between  communication  and  computation  and  are  able  to  measure  the 
three  quantities  separately.  Our  model  is  similar  to  the  LogP  model  [7]  with  L  =  a,  o  =  g  =  (3,  and  no  network  capacity  constraint. 

Our  theoretical  analysis  also  precludes  recomputation  within  a  parallelization  of  an  algorithm,  as  we  associate  an  algorithm  with  a 
set  of  vertices,  each  of  which  is  a  computation,  and  assign  them  to  unique  processors.  However,  a  parallel  algorithm  which  employs 
recomputation  has  a  different  dependency  graph  structure,  to  which  our  analysis  may  subsequently  be  applied.  While  there  are  many 
existing  parallel  algorithms  for  the  applications  we  explore  that  employ  recomputation,  to  the  best  of  our  knowledge  none  of  them 
perform  less  communication  than  ascribed  by  the  recomputation-excluding  lower  bounds  we  present. 

We  reason  about  parallel  algorithms  by  considering  the  dependency  graphs  of  certain  computations.  We  will  first  derive  theoretical 
machinery  for  obtaining  lower  bounds  on  dependency  graphs  with  certain  expansion  parameters  and  then  show  how  this  result  yields 
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lower  bounds  on  S,  W,  F  for  several  algorithms  in  numerical  linear  algebra  with  common  dependency  structures.  Most  of  our  lower 
bounds  apply  to  computations  which  have  n(n'^)  vertices,  with  a  d-dimensional  lattice  dependency  structure,  and  take  the  form 

F  ■  5'^-^  =  n  {n‘^)  ,  W  ■  =  n  (n‘^-1)  . 

These  bounds  indicate  that  a  growing  amount  of  local  computation,  communication,  and  synchronization  must  be  done  to  solve  a 
larger  global  problem.  Thus,  the  bounds  are  important  because  they  highlight  a  scalability  bottleneck  dependent  only  on  local  proces¬ 
sor/network  speed  and  independent  of  the  number  of  processors  involved  in  the  computation. 

In  particular,  we  show: 

•  For  solving  a  dense  n-by-n  triangular  system  by  substitution  (TRSV), 

^TRSV  •  -S'trsv  =  (n^)  , 

•  For  Gaussian  elimination  (GE)  of  a  dense  n-by-n  matrix, 

Fge  •  S'ge  =  ^  I  •  «5'ge  =  ^  (n^)  , 

•  For  computing  an  (s  +  1) -dimensional  Krylov  subspace  basis  with  a  (2m  -f  -point  stencil  (defined  in  Section  6.3), 

FKr  -3^,  =  ^  (m‘^  •  s‘^+1)  ,  WKr  '  ^  '  s^)  ■ 

The  lower  bounds  which  we  derive  in  this  paper  establish  the  communication  optimality  of  the  parallel  algorithms  for  LU  factorization 
given  by  Tiskin  [16]  and  Solomonik  and  Demmel  [20].  The  parallel  schedules  for  LU  and  QR  in  these  papers  are  parameterized  and 
exhibit  a  trade-off  between  synchronization  and  communication  bandwidth  cost.  Our  paper  answers  an  open  question  posed  by  Tiskin  at 
the  end  of  [24],  showing  that  it  is  not  possible  to  achieve  an  optimal  bandwidth  cost  for  LU  factorization  without  an  associated  increase 
in  synchronization  cost  (within  the  limits  of  our  assumptions). 

In  [20],  a  lower  bound  proof  was  given  which  demonstrated  this  trade-off  for  LU  factorization.  However,  the  proof  argument  in  [20] 
did  not  consider  the  possibility  of  overlap  between  the  communication  necessary  to  factorize  each  block,  and  therefore  was  incorrect. 
This  paper  extends  the  idea  of  this  tradeoff  to  a  more  general  theoretical  context,  and  presents  a  significantly  corrected  proof  in  this 
new  framework.  We  show  that  Gaussian  elimination  is  just  one  of  many  numerical  algorithms  whose  dependency  structure  necessitates 
the  tradeoff.  In  particular,  we  conjecture  that  our  results  extend  to  other  dense  matrix  factorizations  such  as  QR,  problems  in  dynamic 
programming,  and  certain  graph  algorithms. 


2  Previous  work 


Theoretical  lower  bounds  on  communication  volume  and  synchronization  can  be  parameterized  by  a  local/fast  memory  of  size  M. 
Most  previous  work  has  considered  the  total  sequential  or  parallel  communication  volume  Q,  which  corresponds  to  the  amount  of 
data  movement  across  the  network  (by  all  processors),  or  through  the  memory  hierarchy.  Hong  and  Kung  [14]  introduced  sequential 
communication  volume  lower  bounds  for  computations  including  n-by-n  matrix  multiplication,  Qmm  =  U(n^/-\/M),  the  n-point 
LET,  Qfft  =  log(n) /  log(M)),  and  the  d-dimensional  diamond  DAG  (a  Cartesian  product  of  line  graphs  of  length  n),  Qdmd  = 

D,(n‘^ Irony  et  al.  [13]  extended  this  approach  to  distributed-memory  matrix  multiplication  on  p  processors,  obtaining  the 
bound  TUmm  =  3l(n^ /p\/M).  Aggarwal  et  al.  [1]  proved  a  version  of  the  memory-independent  lower  bound  VFmm  =  U(n^/p^/^),  and 
Ballard  et  al.  [2]  explored  the  relationship  between  these  memory-dependent  and  memory-independent  lower  bounds.  Ballard  et  al.  [3] 
extended  the  results  for  matrix  multiplication  to  Gaussian  elimination  of  n-by-n  matrices  and  many  other  matrix  algorithms  with  similar 
structure,  finding 


IUge  =  ^ 


pyj  min(M,  rF  jp^/^) 


Bender  et  al.  [5]  extended  the  sequential  communication  lower  bounds  introduced  in  [14]  to  sparse  matrix  vector  multiplication. 
This  lower  bound  is  relevant  to  our  analysis  of  Krylov  subspace  methods,  which  essentially  perform  repeated  sparse  matrix  vector 
multiplications.  However,  [5]  used  a  sequential  memory  hierarchy  model  and  established  bounds  in  terms  of  memory  size  and  track 
(cacheline)  size,  while  we  focus  on  interprocessor  communication. 

Papadimitriou  and  Ullman  [17]  demonstrated  tradeoffs  for  the  2-dimensional  diamond  DAG  (a  slight  variant  of  that  considered 
in  [14]).  They  proved  that  the  amount  of  computational  work  Fdmd  along  some  execution  path  (in  their  terminology,  execution  time)  is 
related  to  the  communication  volume  Qdmd  and  synchronization  cost  S'dmd  as 

-^dmd  '  Qdmd  “  ^  )  Rud  Fdmd  *  ‘5'dmd  ~  ^  (tt  )  • 


These  tradeoffs  imply  that  in  order  to  decrease  the  amount  of  computation  done  along  the  critical  path  of  execution,  more  communication 
and  synchronization  must  be  performed.  Lor  instance,  if  an  algorithm  has  ‘execution  time’  cost  of  Fdmd  =  U(n&),  it  requires  S'dmd  = 
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Q.{n/h)  synchronizations  and  a  communication  volume  of  Qdmd  =  ^(ri? /h).  The  tradeoff  on  Tdmd  •  5'dmd  is  a  special  case  of  the 
d-dimensional  bubble  latency  lower  bound  tradeoff  we  derive  in  the  next  section,  with  d  =  2.  This  diamond  DAG  tradeoff  was  also 
demonstrated  by  Tiskin  [22]. 

Bampis  et  al.  [4]  considered  finding  the  optimal  schedule  (and  number  of  processors)  for  computing  d-dimensional  grid  graphs, 
similar  in  structure  to  those  we  consider  in  Section  6.  Their  work  was  motivated  by  [17]  and  took  into  account  dependency  graph 
structure  and  communication,  modeling  the  cost  of  sending  a  word  between  processors  as  equal  to  the  cost  of  a  computation. 

We  will  introduce  lower  bounds  that  relate  synchronization  to  computation  and  data  movement  along  dependency  paths.  Our  work 
is  most  similar  to  the  approach  in  [17];  however,  we  attain  bounds  on  W  (the  parallel  communication  volume  along  some  dependency 
path),  rather  than  Q  (the  total  communication  volume).  Our  theory  obtains  tradeoff  lower  bounds  for  a  more  general  set  of  dependency 
graphs  which  allows  us  to  develop  lower  bounds  for  a  wider  set  of  computations. 

3  Theoretical  model 

We  first  introduce  of  the  mathematical  notation  we  will  employ  throughout  this  and  later  sections 

•  Sets  are  defined  as  uppercase  letters  (S,  V). 

•  Vectors  are  defined  as  lowercase  boldface  letters  (v)  and  the  ith  element  of  v  is  indexed  as  Vi. 

•  Accordingly,  matrices  and  tensors  are  defined  as  uppercase  boldface  letters  (A,  T),  with  elements  Aij,Tijk. 

•  Functions  and  maps  will  be  defined  as  Greek  letters  (a{n)). 

•  Open/closed  intervals  in  M  are  denoted  (a,  b)  and  [a,  h]. 

The  dependency  graph  of  a  program  is  a  directed  acyclic  graph  (DAG)  G  =  (V,  E).  The  vertices  V  =  I  U  Z  U  O  correspond  to 
either  input  values  I  (the  vertices  with  in-degree  0),  or  the  results  of  (distinct)  operations,  in  which  case  they  are  either  temporary  (or 
intermediate)  values  Z,  or  outputs  O.  There  is  an  edge  {u,v)  G  E  C  V  x  {Z  U  O)  for  each  operand  u  of  each  operation  v.  These 
edges  represent  data  dependencies,  and  impose  limits  on  the  parallelism  available  within  the  computation.  For  instance,  if  G  =  {V,  E) 
is  a  line  graph  with  V  =  {wi, . . . ,  t;„}  and  E  =  {(vi,V2),  ■  ■  ■ ,  (vn-i,Vn)},  the  computation  is  entirely  sequential,  and  a  lower  bound 
on  the  execution  time  is  the  time  it  takes  a  single  processor  to  compute  E  =  n  —  1  operations.  Using  graph  expansion  and  hypergraph 
analysis,  we  will  derive  lower  bounds  for  computation  and  communication  for  dependency  graphs  with  certain  properties.  Then  we  will 
show  that  these  lower  bounds  can  be  applied  to  several  important  problems  in  numerical  linear  algebra.  These  lower  bounds  have  the 
form  of  tradeoffs  between  data  transfer  and  synchronization  cost  and  between  work  and  synchronization  cost,  and  form  a  conceptual 
communication  wall  which  limits  parallelism. 

A  parallelization  of  an  algorithm  corresponds  to  a  coloring  of  its  dependency  graph,  i.e.,  a  partition  of  the  vertices  into  p  disjoint 
sets  V  =  ljr=i  where  processor  i  for  i  G  {1, ...  ,p}  computes  Ci  D  {Z  U  O).  We  require  that  in  any  parallel  execution  among  p 
processors,  at  least  two  processors  compute  \]Z  U  0|/pj  elements;  this  assumption  is  necessary  to  avoid  the  case  of  a  single  processor 
computing  the  whole  problem  sequentially  (without  parallel  communication).  Any  vertex  v  of  color  i(v  G  Ci)  must  be  communicated 
to  a  different  processor  j  if  there  is  an  edge  from  v  to  a  vertex  in  Cj  (though  there  need  not  necessarily  be  a  message  going  directly 
between  processor  i  and  j,  as  the  data  can  move  through  intermediate  processors).  We  define  each  processor’s  communicated  set  as 

T,  =  {u:  {u,  w)  G  [(G,  X  (V  \  G,))  U  ((U  \  G,)  x  G,)]  n  E}  . 

We  note  that  each  Ti  is  a  vertex  separator  in  G  between  Ci  \  Ti  and  V  \  {Ci  U  T^).  For  each  processor  i,  a  communication  schedule, 
{nii,  {Rij},  {Fij},  {Mij}},  defines  a  sequence  of  rrii  time-steps.  The  time-steps  may  differ  from  processor  to  processor  and  relate  to 
each  other  only  via  explicit  communication  in  the  schedule.  At  time-step  j  G  {1, . . .  ,mi},  processor  i  receives  a  set  of  values  Rij  C  V 
(which  are  packed  in  one  message  and  originate  from  a  single  processor),  performs  a  computation  (or  no  computation)  Fij  C  V, 
\Fij\  <  1,  and  sends  a  set  of  values  Mij  C  V  (which  are  also  packed  in  one  message  and  have  a  single  destination  processor).  If  both 
Rij  and  Mij  are  empty,  no  communication  or  synchronization  is  done  by  processor  i  at  time-step  j.  Similarly,  if  Fij  =  0,  no  computation 
is  done  by  processor  i  at  time-step  j.  We  require  that  communication  is  point-to-point,  so  each  message  Mij  must  be  received  on  some 
unique  processor  k  at  some  time-step  I,  that  is  Rki  =  Mij.  The  schedule  must  perform  all  computations  Ci  =  IJ^i  Fij  and  the 
messages  sent  and  received  by  each  processor  i  must  include  their  communicated  set  R  C  IJJ^i  ^ij  U  Rij  (a  processor  may  send  and 
receive  values  not  in  its  communicated  set  in  order  to  avoid  synchronization  of  other  processors).  Further,  the  schedule  should  respect 
dependencies,  that  is,  for  each  /  G  Fij  and  each  {u,  f)  G  E,\f  u  G  Ci  then  u  G  Fik  for  some  k  G  {1, . . . ,  j  —  1},  and  if  t6  ^  Ci,  we 
require  that  u  G  Rik  for  some  k  G  {1, ...  ,j}.  Any  value  sent,  Vu  G  Mij,  must  either  be  an  input  {u  G  CiO  I)  or  must  have  been 
previously  received  or  computed  by  that  process,  which  means  there  exists  k  G  {1, . . . ,  j}  such  that  u  G  Fik  or  u  €  Rik. 

An  execution  of  a  parallel  algorithm  is  associated  with  a  communication  schedule  which  can  be  represented  by  the  weighted  DAG 
G  =  {V ,  E).  For  each  processor  i  G  {1,  •  ■  •  ,p}  and  time-step  j  G  mi],  we  include  vertex  {i,j)  G  V .  The  graph  includes 

0-weighted  edges  {{i,  k),  (z,  fc  -I-  1), 0)  G  E  for  k  G  {1, . . .  ,mi  —  1}  (corresponding  to  sequentially  executed  computations  Fik  and 
Fi^k+i)-  The  graph  also  includes  unit-weighted  edges,  {{i,j),  {k,  1),  1)  G  E,  for  each  nonempty  message  pair,  0  7^  Mij  =  Rki,  where 
i,k  G  {I,...  ,p),  i  ^  k,  j  G  {1, . .  .,mi},  I  G  {I, . .  .,mk}. 
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We  measure  the  execution  costs  by  considering  all  execution  paths  11  that  exist  in  the  communication  schedule  DAG  G:  each 
execution  path  tt  G  n  is  of  the  form  tt  =  ,  (z„,  j„)}  where  for  each  fc  G  {1, ...  ,n  -  1},  {{ik,jk),  {ik+i,jk+i),w)  G  E 

and  w  G  {0, 1}.  Since  any  two  computations  on  tt  happen  either  on  the  same  processor  or  follow  a  series  of  messages,  the  time  to 
execute  a  path  is  a  lower  bound  on  the  total  execution  time.  The  computation  cost  corresponds  to  the  longest  unweighted  path  through 
the  schedule,  i.e.. 


F 


=  max 

TT^n 


E 


(i,j)GTr 


Note  that  this  value  is  greater  than  or  equal  to  the  work  done  by  any  single  processor,  i.e.,  for  alH,  T"  >  | Fij  \  =  \Ci\.  The  bandwidth 

cost  of  the  communication  schedule  corresponds  to  the  maximum  vertex-weighted  path  through  the  schedule,  with  each  vertex  weighted 
by  the  size  of  the  messages  and  Rij,  i.e., 

W  =  max  ^  \Mij\  +  \Rij\. 


Similarly,  the  bandwidth  cost  is  at  least  as  large  as  that  incurred  by  any  single  processor,  i.e.,  for  all  i,  W  >  E,  \M,,\  +  |%|. 
Accordingly,  the  latency  cost  corresponds  to  the  longest  edge-weighted  path  through  the  schedule.  Defining  the  edge-to-weight  mapping 
v)  =  w  for  each  {u,  v,  w)  G  E,  the  maximum  weighted  path  in  the  schedule  is  given  by 


n— 1 

S=  max  Vw((4,jfe),(4+i,jfc+i)). 


Our  goal  will  be  to  lower  bound  the  bandwidth  and  latency  costs  for  any  parallelization  and  communication  schedule  of  a  given  depen¬ 
dency  graph. 


4  General  lower  bound  theorem  via  bubble  expansion 

In  this  section,  we  introduce  the  concept  of  dependency  bubbles  and  their  expansion.  Bubbles  represent  sets  of  interdependent  compu¬ 
tations,  and  their  expansion  allows  us  to  analyze  the  cost  of  computation  and  communication  for  any  parallelization  and  communication 
schedule.  We  will  show  that  if  a  dependency  graph  has  a  path  along  which  bubbles  expand  as  some  function  of  the  length  of  the  path, 
any  parallelization  of  this  dependency  graph  must  sacrifice  synchronization  or  incur  higher  computational  and  data  volume  costs,  which 
scale  with  the  total  size  and  cross-section  size  (minimum  cut  size)  of  the  bubbles,  respectively. 

4.1  Bubble  expansion 

Given  a  dependency  graph  G  =  {V,  E),  we  say  Vn  &  V  depends  on  wi  G  if  and  only  if  there  is  a  path  V  C  V  connecting  vi  to 
Vn,  i.e.,  V  =  {wi,  •  •  • ,  Vn}  such  that  {(ui,  V2),  ■ . . ,  u„)}  C  E.  We  denote  a  sequence  of  (not  necessarily  adjacent)  vertices 

{rui, . . . ,  Wn}  a  dependency  path,  if  for  z  G  {1, . . . ,  n  —  1},  zu^+i  is  dependent  on  Wi. 

The  (dependency)  bubble  around  a  dependency  path  V  connecting  vi  to  is  a  sub-DAG  /3(G,  V)  =  (Vg,  Ep)  where  Vp  C  V, 
each  vertex  u  G  Vp  lies  on  a  dependency  path  {ui, . . .  ,u, . . . ,  z;„}  in  G,  and  Ep  =  (Vp  x  Vp)  n  E.  This  bubble  corresponds  to  all 
vertices  which  must  be  computed  between  the  start  and  end  of  the  path.  Equivalently,  the  bubble  may  be  defined  as  the  union  of  all  paths 
between  vi  and  z;„. 

4.2  Lower  bounds  based  on  bubble  expansion 

A  i-balanced  vertex  separator  Q  c  V  of  a  graph  G  =  iV,E)  splits  V  \  Q  =  Vi  U  V2  so  that  min(|Vi|,  IV2I)  >  LI^I/^J 
E  C  (Vi  X  Vi)  U  (V2  X  V2)  U  {Q  X  V)  U  {V  x  Q).  We  denote  the  minimum  size  \Q\  of  a  ^-balanced  separator  Q  of  G  as  XqiG).  If 
/3(G,  V)  is  the  dependency  bubble  formed  around  path  V,  we  say  Xq{P{G,  V))  is  its  cross-section  expansion. 

Definition  4.1  We  call  a  directed-acyclic  graph  G  a  (e,  aj-path-expander  if  there  exists  a  dependency  path  V  in  the  graph  G 
and  a  positive  integer  constant  k  <C  \V\  such  that  every  subpath  7^  C  7^  of  length  \R\  >  k  has  bubble  /3{G,Tl)  =  {Vp,Ep)  with 
cross-section  expansion  Xqi/3{G,Tl))  —  n(e(|7^|))  for  any  real  constant  q  <C  |7^|,  q  >  2,  and  bubble  size  \Vp\  =  0(ct(|7^|)),  for 
some  real-valued  functions  e,  a  which  for  real  numbers  &  >  fc  are  positive  and  increasing  with  1  <  e(6  -I-  1)  —  e{h)  <  Cce{b)  and 
1  <  a{b  -f  1)  —  a{h)  <  Ca<j{b)  for  some  positive  real  constants  q,  ^  \V\. 

Theorem  4.1  (General  bubble  lower  bounds).  Suppose  a  dependency  graph  G  is  a  (e,  a) -path-expander  about  dependency  path  V. 
Then,  for  any  parallelization  of  G  in  which  no  processor  computes  more  than  half  of  the  vertices  of  j3{G,  V),  and  for  any  communication 
schedule,  there  exists  an  integer  b  G  [k,  \VW  such  that  the  computation  (F),  bandwidth  (W),  and  latency  (S)  costs  incurred  are 

F  =  n{a{b)-\V\/b),  W  =  n{eib)-\V\/b),  s  =  n{\v\/b). 
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Computation  chain 


Figure  1:  Illustration  of  the  construction  in  the  proof  of  Theorem  4.1  in  the  case  of  a  diamond  DAG  (e.g.,  [17]),  depicting  a  dependency 
path,  and  communication  and  computation  chains  about  that  path,  for  a  2-processor  parallelization. 


Proof.  We  consider  any  possible  parallelization,  which  implies  a  coloring  of  the  vertices  of  G  =  {V,E),  V  =  ljr=i  show 

that  any  communication  schedule  {mi,{  Rij  } ,  {  Fij  } ,  {  }  }  for  {1, . . .  ,p},  j  G  corresponding  to  the  communication 

schedule  graph  G  =  {V ,  E),  as  defined  in  Section  3,  incurs  the  desired  computation,  bandwidth,  and  latency  costs.  Our  proof  technique 
works  by  defining  a  chain  of  bubbles  within  G  in  a  way  that  allows  us  to  accumulate  the  costs  along  the  chain. 

The  tradeoff  between  work  and  synchronization,  F  =  il{a{b)  ■  \V\/b)  and  S  =  0(|7^|/6),  can  be  derived  by  considering  a  compu¬ 
tation  chain:  a  sequence  of  monochrome  bubbles  along  V,  each  corresponding  to  a  set  of  computations  performed  sequentially  by  some 
processor  (see  computation  chain  in  Figure  1).  However,  to  obtain  the  bandwidth  lower  bound,  we  must  instead  show  that  there  exists  a 
sequence  of  bubbles  in  which  some  processor  computes  a  constant  fraction  of  each  bubble;  we  then  sum  the  bandwidth  costs  incurred 
by  each  bubble  in  the  sequence.  We  show  a  communication  chain  (a  sequence  of  multicolored  bubbles)  for  a  diamond  DAG  in  Figure  1. 

Every  subpath  TZ  C  V  of  length  \7l\  >  k  induces  a  bubble  /3(G,  TZ)  =  (V/^,  Ep)  of  size  \Vp\  =  0((t(|7^|)).  The  ©-notation  implies 
that  there  exists  a  positive  finite  constant  d  <C  \V\  such  that  for  all  \TZ\  >  k, 

am)/d<\Vp\<d-am). 


We  also  fix  the  constant 

q  =  max  {d^{ccr  -f  1)  -f  1,  d  •  cr(fc)}  ; 

note  that  q  ^  \V\,  since  k,  d,  c^,  a(k)  <C  \'P\. 

We  define  the  bubbles  via  the  following  procedure,  which  partitions  the  path  V  into  subpaths  by  iteratively  removing  leading  sub¬ 
paths.  Assume  we  have  defined  subpaths  TZj  for  j  G  {1, ...  ,i  —  1}.  Let  the  tail  (remaining  trailing  subpath)  of  the  original  path 
be 

\  U  ^  {^1’  ■  •  ■  • 

i=i 

Our  procedure  defines  the  next  leading  subpath  of  length  r,  TZi  =  {G, . . .  ,tr},k  <  r  <  |7i|,  with  bubble  f3{G,TZi)  =  {Vi,Ei). 
Suppose  processor  I  computes  ti.  The  procedure  picks  the  shortest  leading  subpath  of  7)  of  length  r  >  k  which  satisfies  the  following 
two  conditions,  and  terminates  if  no  such  path  can  be  defined. 

Condition  1:  The  subset  of  the  bubble  that  processor  I  computes,  Ci  D  Vi,  is  of  size  at  least  |G;  n  Vi|  > 

Condition  2:  The  subset  of  the  bubble  that  processor  I  does  not  compute,  Vi\Ci,  is  also  of  size  at  least  |F]  \  G^j  >  [|Vi|/7j 

Let  c  be  the  number  of  subpaths  the  procedure  outputs  for  the  given  parallelization  of  G  and  T  —  V  \  Uj=i  =  {G  >  ■  •  •  >  i|r| }  be 
the  tail  remaining  at  the  end  of  the  procedure.  We  consider  the  two  cases,  |T|  =  i7(|G|)  and  \T^j  \  =  Isast  one  of  which 

must  hold.  We  show  that  in  either  case,  the  theorem  is  true  for  some  value  of  b. 

If  |T|  =  i7(|7^|)  (the  tail  is  long),  we  show  that  Condition  1  must  be  satisfied  for  any  leading  subpath  of  T.  Our  proof  works  by 
induction  on  the  length  of  the  leading  subpath  k  <  r  <  |T|,  with  the  subpath  given  hy  ICr  =  {ti, . . . ,  G}-  We  define  the  bubble  about 
ICr  as  /3(G,  ICr)  =  {Vr,  Er) ■  When  r  =  k,  Condition  1  is  satisfied  because  \  Vr\/q  <  d  ■  a{k)/q  <  1  and  processor  I  computes  at  least 
one  element,  G- 

Lor  r  >  k,  we  have 

\Vr\  <  d  ■  a{r)  <  d{ca  +  1)  •  <j{r  —  1)  <  —  •  cr{r  —  1). 

Lurther,  by  induction.  Condition  1  was  satisfied  for  ICr-i  which  implies  Condition  2  was  not  satisfied  for  ICr-i  (otherwise  the  procedure 
would  have  terminated  with  a  path  of  length  r  —  1).  Now,  using  bounds  on  bubble  growth  we  that  since  Condition  2  was  not  satisfied  for 
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Kr-i,  Condition  1  has  to  be  satisfied  for  the  subsequent  bubble,  Kr, 


\CinvA  >  IQnK-il  >  iK-il 
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>  —\Vr-l\  >  -  1)  >  VrI  > 

q  dq  q 
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so  Condition  1  holds  for  ICr  for  r  €  {k, . . . ,  |T|}.  Further,  since  the  tail  is  long,  |T|  =  n(|7^|),  due  to  Condition  1,  processor  I  must 
compute 

F>  Ll^/3(G,r)l/9j  =niai\m 

vertices.  Since,  by  assumption,  no  processor  can  compute  more  than  half  of  the  vertices  of  /3(G,  V),  we  claim  there  exists  a  subpath  Q 
ofV,T  C  Q  C  V,  where  processor  I  computes  [|VJ3((5  g)  j/gj  vertices  and  does  not  compute  [|V53((5  g)  j/gj  vertices.  The  path  Q  may 
always  be  found  to  satisfy  these  two  conditions  simultaneously,  since  we  can  grow  Q  backward  from  T  until  Condition  2  is  satisfied, 
i.e.,  processor  I  does  not  compute  at  least  LI^/3(G,C) I/9J  vertices,  and  we  will  not  violate  the  first  condition  that  (\Ci  n  C^(g.C)I  ^ 
LI^/3(G,C)  I/9J  ^  which  holds  for  T,  due  to  bounds  on  growth  of  |  V^(G,g)  |.  The  proof  of  this  assertion  is  the  same  as  the  inductive  proof 
above  which  showed  that  Condition  1  holds  on  ICr-  So,  processor  I  must  incur  a  communication  cost  proportional  to  a  ^-balanced  vertex 
separator  of  /3{G,  Q)  with  size  W  =  n(e(|  Q|))  =  fl(e(|T|)).  Since  these  costs  are  incurred  along  a  path  in  the  schedule  consisting  of  the 
work  and  communication  done  only  by  processor  /,  the  bounds  hold  for  b  =  |T|  (note  that  n(|7^|/|T|)  =  ^^(1),  because  |T|  =  n(|7^|)). 

In  the  second  case,  \F-j\  =  the  procedure  generates  subpaths  with  a  total  size  proportional  to  the  size  of  V.  For  each 

i  G  {1, . . . ,  c},  consider  the  time-step  m  during  which  processor  I  computed  the  first  vertex  ti  on  the  path  TZi,  that  is,  {I,  m)  G  V  where 
I  G  {1,  ■  ■  ■  ,p}  and  m  G  {1 . . .  mi},  such  that  Fim  =  {fi}  (recall  that  each  process  computes  at  most  one  vertex  every  time-step).  We 
choose  the  smallest  s  >  m  such  that  C;  n  C  Um'=m  Fim'  (processor  I  computes  its  part  of  Vi  between  time-steps  m  and  s).  Now 
consider  the  time-step  v  on  processor  u,  {u,  v)  G  V,  during  which  the  last  vertex  G  on  the  path  7Zj  was  computed,  that  is,  (u,  v)  G  V 
where  u  G  {1, . . .  ,p}  and  v  G  {1, . . .  ,m„},  such  that  F^v  =  {G}-  Note  that  for  some  z  G  V,  Fis  =  {z},  (otherwise,  s  can  be 

taken  to  be  to  s  —  1,  and  so  is  not  the  smallest)  and  tr  is  dependent  on  z  (G  is  dependent  on  all  vertices  in  the  bubble).  So,  there 

will  be  an  execution  path  tt^  =  {(/,  m), s), ...,  (u,  u)}  C  F  in  the  communication  schedule.  This  path  has  outgoing  messages 
{Mim,  ■  •  ■ ,  Mis,  ■  ■  ■ ,  Muv}  and  incoming  messages  {Rim,  ■  ■  ■ ,  Ris,  ■  ■  ■ ,  Ruv}  that  include  all  vertices  in  V)  which  processor  I  must 
communicate  to  compute  Vi  OCi,  which  is  given  by  the  set 

fu  =  [u  :  {u,  w)  G  [{Cl  X  {V  \  Cl))  U  iiV  \  Cl)  x  Ci)]  n  E,}  , 

which  is  a  separator  of  /3{C,  TZi)  and  is  ^-balanced  due  to  Conditions  1  and  2  in  the  definition  of  TZi.  We  use  the  lower  bound  on  the 

minimum  separator  of  a  bubble  to  obtain  a  lower  bound  on  the  size  of  the  communicated  set  for  processor  I  in  the  ith  bubble, 

\fu\>x,{P{G,TZ,))  =  n{e{\TZ,\)), 


where  we  are  able  to  bound  the  growth  of  P{C,  TZi),  since  \TZi\  >  k.  There  exists  a  dependency  path  between  the  last  element  of  TZi  and 
the  first  of  TZi+i  since  they  are  subpaths  of  T’,  so  every  bubble  /3{C,  TZi)  must  be  computed  entirely  before  any  members  of  /3{C,  TZi+i) 
are  computed.  Therefore,  there  is  an  execution  path  TTcriticai  C  C  in  the  communication  schedule  which  contains  Wi  C  TTcriticai  as  a  subpath 
for  every  z  G  {1, . . . ,  c}.  Communication  and  computation  along  TTcriticai  can  be  bounded  below  by 


F=J2  l^’Gl>EJl/3(G,7^r)| 

(*  j')6'n’critical 


\i=l  / 


w=Y.mj\  +  \R^,\> 

(tjf  )  £ '^'critical 

Further,  since  each  bubble  contains  vertices  computed  by  multiple  processes,  between  the  first  and  last  vertex  on  the  subpath  forming 
each  bubble,  a  network  latency  cost  of  at  least  one  must  be  incurred  per  bubble,  therefore, 

S>=il{c). 

Because  a{b+l)  —  a{b)  >  1  for  b>  k  >  \TZi\,  and  the  sum  of  all  the  lengths  of  the  subpaths  is  bounded  (J2i  \Ri\  <  \F\),  the  above 
lower  bounds  for  F  and  W  are  minimized  when  all  TZi  of  the  same  length^.  Picking  this  length  as  b,  that  is  |7^i|  =  &  =  0(|7^|/c)  for 
all  i,  leads  to  a  simplified  form  for  the  bounds, 

F  =  rt{a{b)-\V\/b),  W  =  n{eib)-\V\/b),  s  =  n{\v\/b). 


J2xMG,n,))=n  T.  <m) 


□ 

^This  mathematical  relation  can  be  demonstrated  by  a  basic  application  of  Lagrange  multipliers. 
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Corollary  4.2  (d-dimensional  bubble  lower  bounds).  Let  V  be  a  dependency  path  in  G,  such  that  every  subpath  TZ  C  V  of  length 
\R\  >  k,  where  k  <C  \V\,  has  bubble  P{G,Tl)  =  {Vp,Ejf}  with  cross-section  expansion  Xq{l3{G,Tl))  =  17(17^1^^“^)  for  any  constant 
2  <  <7  <C  \V\  and  bubble  size  \Vp\  =  for  an  integer  2  <  d  k.  The  computation,  bandwidth,  and  latency  costs  incurred 

by  any  parallelization  of  G  in  which  no  processor  computes  more  than  half  of  the  vertices  of  j3{G,V),  and  with  any  communication 
schedule,  must  obey  the  relations 

F  ■  =  n  {\vf) ,  w  ■  =  n  . 

Proof  This  is  an  application  of  Theorem  4.1  with  e(6)  =  b’^~^  and  a{b)  =  b‘^.  The  theorem  yields 

F  =  n  ■  \v\) ,  w  =  n  {b‘^-‘^  ■  \r\) ,  s  =  n{\r\/b). 

These  equations  can  be  manipulated  algebraically  to  obtain 

F  ■  5'^-^  =  n  {\rf) ,  w  ■  =  n  . 


□ 


5  Lower  bounds  on  lattice  hypergraph  cuts 


For  any  hypergraph  H  =  (V,  E),  we  say  a  hyperedge  e  G  E  is  internal  to  some  C'  C  C  if  e  C  V .  If  no  e  €  is  adjacent  to  (i.e., 
contains)  a  u  €  C'  C  C,  then  say  V  is  disconnected  from  El.  A  ^ -balanced  (hyperedge)  cut  of  is  a  subset  of  E  whose  removal  from 
H  partitions  C  =  Vi  U  V2  with  min(|  Vi  |,  IV2I)  >  ^\V\  such  that  all  remaining  (uncut)  hyperedges  are  internal  to  one  of  the  two  parts. 

We  define  a  d-dimensional  lattice  hypergraph  El  =  {V,E)  of  breadth  n,  with  \V\  =  (^)  vertices  and  \E\  =  hyperedges. 

Each  vertex  is  represented  as  =  fi, . . .  ,if)  for  {ii, . . . ,  ij}  €  {1, . . . ,  n}"^  with  ii  <  ■  ■  ■  <  Each  hyperedge  connects  all 

vertices  which  share  d  —  1  indices,  that  is  for  {ji, . . .  ,jd-i}  €  {1, . . . ,  with  ji  <  ■  ■  ■  <  jd-i  includes  all  vertices 

for  which  {ji, . . .  ,jd-i}  C  {ii, . . . ,  id}-  There  are  n—{d—l)  vertices  per  hyperedge,  and  each  vertex  appears  in  d  hyperedges. 
Each  hyperedge  intersects  {d  —  l)(n  —  {d  —  1))  other  hyperedges,  each  at  a  unique  vertex. 

A  key  step  in  the  lower  bound  proofs  in  [13]  and  [3]  was  the  use  of  an  inequality  introduced  by  Loomis  and  Whitney  [15].  We  will 
use  this  inequality  (in  the  following  form)  to  prove  a  lower  bound  on  the  cut  size  of  a  lattice  hypergraph. 


Theorem  5.1  (Loomis- Whitney).  Let  V  be  a  set  of  d- tuples  {ii,. . .  ,id)  G  N'^,  and  consider  projections  ttj  :  — >■  N'^  ^  for  j  G 

{1, . . . ,  c?}  defined  as 

. . . ,  id)  =  (?i)  •  •  ■ )  ij—ii  *j+i)  ■  •  ■ )  *tz)) 


then  the  cardinality  ofV  is  bounded  by. 


i=i 


Theorem  5.2.  For  2  <  d,q  ^  n,  the  minimum  --balanced  cut  of  a  d-dimensional  lattice  hypergraph  H  =  {V,  E)  is  of  size  Cq{H)  = 


Proof  We  prove  Theorem  5.2  by  induction  on  the  dimension,  d.  In  the  base  case  d  =  2,  we  must  show  that  eq{El)  =  El{n/ 
Consider  any  ^-balanced  cut  Q  C  E,  which  splits  the  vertices  into  two  disjoint  sets  V)  and  V2.  Note  that  in  2  dimensions,  every  pair  of 
hyperedges  overlaps,  i.e.,  for  ii,  12  G  {1, . . . ,  n}  with  ii  <  12,  n  =  {vii.ial-  If  ths  first  partition,  Vi,  has  an  internal  hyperedge, 
then  since  every  pair  of  hyperedges  overlaps,  Vi  is  adjacent  to  all  hyperedges  in  H.  Therefore,  every  hyperedge  adjacent  to  the  other  part 
V2  C  C  must  be  in  the  cut,  Q.  On  the  other  hand,  if  Vi  has  no  internal  hyperedges,  then  all  hyperedges  adjacent  to  Vi  must  connect  both 
parts,  and  thus  are  cut.  So,  without  loss  of  generality  we  will  assume  that  V2  is  disconnected  after  the  cut.  We  now  argue  that  Lt{n/y/q) 
hyperedges  must  be  cut  to  disconnect  V2- 

Since  the  cut  is  ^-balanced,  we  know  that  IV2I  >  n{n  —  l)/{2q).  To  disconnect  each  G  V2,  both  adjacent  hyperedges  (cij  and 
6^2)  must  be  cut.  We  can  bound  from  below  the  cut  size  by  first  obtaining  a  lower  bound  on  the  product  of  the  sizes  of  the  projections 
7ri(vji  12)  =  *2  and  =  *1  via  the  Loomis-Whitney  inequality  (Theorem  5.1), 


|7ri(y2)|  •  k2(V2)|  >  |V"2|  >  n(n  -  l)/(2g), 


and  then  concluding 

eq{H)  =  |7ri(V2)  U  7r2(t^2)| 

^(kl(f"2)|  +  k2(F2)|) 

>  \/ki(V2)|  •  k2(V2)|  >  \/n(n-  l)/{2.q) 
=  n  {n/y/q)  , 
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since  the  size  of  the  union  of  the  two  projections  equals  the  number  of  hyperedges  that  must  be  cut  to  disconnect  V2. 

For  the  inductive  step,  we  assume  that  the  theorem  holds  for  dimension  d  —  1  and  prove  that  it  must  also  hold  for  dimension  d, 
where  c?  >  3.  In  d  dimensions,  we  define  a  hyperplane  Xki,...,kd-2  for  ®^oh  {ki, . . . ,  kd-2}  G  {!>  •  •  ■  >  with  ki  <  •  •  ■  <  kd-2 

as  the  set  of  all  hyperedges  which  satisfy  {fci, . . . ,  kd-2}  C  {ji, . . . ,  jd-i}-  Thus,  each  of  the  |X|  =  (^"2)  hyperplanes 

contains  n  —  (d  —  2)  hyperedges,  and  each  hyperedge  is  in  d  —  1  hyperplanes.  Note  that  each  hyperplane  shares  a  unique  hyperedge  with 
(d  —  2)  (n  —  (d  —  2))  other  hyperplanes.  Further,  each  hyperedge  in  a  hyperplane  intersects  each  other  hyperedge  in  the  same  hyperplane 
in  a  unique  vertex,  and  the  set  of  all  these  vertices  are  precisely  those  sharing  the  d  —  2  indices  defining  the  hyperplane. 

Consider  any  f -balanced  hypergraph  edge  cut  Q  C  E.  Since  all  hyperedges  which  contain  vertices  in  both  Vi  and  V2  must  be  part 
of  the  cut  Q,  all  vertices  are  either  disconnected  completely  by  the  cut  or  remain  in  hyperedges  which  are  all  internal  to  either  Vi  or  V2. 
Let  Ui  C  Vi  be  the  vertices  contained  in  a  hyperedge  internal  to  Vi  and  let  1/2  C  V2  be  the  vertices  contained  in  a  hyperedge  internal  to 
V2-  Since  both  Vi  and  1^2  contain  [n‘^/q\  vertices,  either  [n‘^/2q\  vertices  must  be  in  internal  hyperedges  within  both  Vi  as  well  as  V2, 
that  is, 

case  (i):  |17i|  >  [n‘^/2q\  and  IC/2I  >  [n‘^/2q\, 
or  there  must  be  [n‘^/2q\  vertices  that  are  disconnected  completely  by  the  cut, 

case  (ii):  |(Vi  \  Ui)  U  (V2  \  U2)\  >  [n‘^/2q\. 

In  case  (i),  since  both  Ui  and  U2  have  at  least  [n‘^/2q\  vertices,  we  know  that  there  are  at  least  \Ui\/{n  —  (d  —  1))  >  ^ /‘^q\ 

hypergraph  edges  Wi  which  are  internal  to  Vi  after  the  cut  Q,  and  a  similar  set  of  hyperedges  W2  internal  to  V2.  We  now  obtain  a  lower 
bound  on  the  size  of  the  cut  Q  for  this  case  by  counting  the  hyperplanes  which  Q  must  contain.  Our  argument  relies  on  the  idea  that  if 
two  hypergraph  edges  are  in  the  same  hyperplane,  the  entire  hyperplane  must  be  disconnected  (all  of  its  hyperedges  must  be  part  of  the 
cut  Q)  in  order  to  disconnect  the  two  edges.  This  allows  us  to  bound  the  number  of  hyperplanes  which  must  be  disconnected  in  order 
for  Wi  to  be  disconnected  from  IL^. 

We  define  a  new  (d  —  1) -dimensional  lattice  hypergraph  H'  =  {E,  X),  with  vertices  and  hyperedges  equal  to  the  hyperedges  and 
hyperplanes  of  the  original  hypergraph  El.  The  cut  Q  induces  a  ^-balanced  cut  on  id'  since  it  creates  two  disconnected  partitions  of 
hyperedges:  Wi  and  IL2,  each  of  size  /2q\ .  We  can  assert  a  lower  bound  on  the  size  of  any  ^-balanced  cut  of  El'  by  induction, 

eg(id')  =  n  ^  ■ 

This  lower  bound  on  cut  size  of  H'  yields  a  lower  bound  on  the  number  of  hyperplanes  which  must  be  cut  to  disconnect  the  hyperedges 
into  two  balanced  sets  Wi  and  IL^.  Remembering  that  disconnecting  each  hyperplane  requires  cutting  all  of  its  internal  n  —  (d  —  2) 
hyperedges  (and  also  that  each  pair  of  hyperplanes  overlap  on  at  most  one  hyperedge),  allows  us  to  conclude  that  the  number  of 
hyperedges  cut  (in  Q)  must  be  at  least 

e,(dd)  >  =  F!  (n''" • 

The  quantity  on  the  right  is  always  larger  than  the  lower  bound  we  are  trying  to  prove,  eq{H)  =  so  the  proof  for  this 

case  is  complete. 

In  case  (ii),  we  know  that  [n‘^/2q\  vertices  U  C  V  are  disconnected  by  the  cut  (before  the  cut,  every  vertex  was  adjacent  to  d 
hyperedges).  We  define  d  projections, 

■  ■  ■  iL'— 

for  j  G  {1, . . . ,  d}  corresponding  to  each  of  d  hyperedges  adjacent  to  We  apply  the  Loomis-Whitney  inequality  (Theorem  5.1) 

to  obtain  a  lower  bound  on  the  product  of  the  size  of  the  projections, 

d 

l[\n,{U)\^/^'^-^^>\U\>[n^/2q\, 

i=i 

and  then  conclude  with  a  lower  bound  on  the  number  of  hyperedges  in  the  cut  of  H, 

eq{H)  >  I  U  7r,(i7)|  ^  ^  n 

i=i  i=i 

where  we  discard  the  constant  ^  by  applying  our  assumption  that  d  ^  n.  By  induction,  this  lower  bound  holds  for  all  2  <  d  <C  n.  □ 


6  Applications 


In  this  section,  we  apply  the  general  theorems  derived  in  the  previous  sections  to  obtain  lower  bounds  on  the  costs  associated  with  a  few 
specific  numerical  linear  algebra  algorithms.  We  treat  the  dependency  graphs  of  the  algorithms  in  a  general  manner  by  reducing  them  to 
lattice  hypergraphs. 

Consider  a  dependency  graph  G  =  (V,  E)  and  a  partition  {Ei}  of  its  edge  set  E.  Let  Vi  be  all  the  vertices  adjacent  to  the  edge 
partition  Ei  for  each  i  (while  {Ei}  is  a  disjoint  partition  of  E,  {Vi}  is  not  necessarily  a  disjoint  partition  of  V).  We  can  define  a 
hypergraph  E[  =  {V,  D)  based  on  G,  where  in  each  hyperedge  di  G  D,  every  pair  of  vertices  u,v  G  di  is  connected  in  G  via  a  path 
consisting  of  edges  in  Ei.  By  defining  El  in  this  manner,  any  cut  of  G  G  E  corresponds  to  a  hypergraph  cut  of  El  with  at  most  \G\ 
hyperedges,  which  may  be  obtained  by  cutting  all  hyperedges  di  G  D  corresponding  to  parts  Ei  which  contain  cut  edges,  i.e.,  3e  G  Ei 
such  that  e  G  C. 

We  will  also  employ  hyperedges  to  obtain  lower  bounds  on  sets  of  vertices  connected  via  arbitrary  reduction  (sum)  trees.  Any 
reduction  tree  T  =  (R,  E)  which  sums  a  set  of  vertices  S  C  R  must  connect  each  pair  of  vertices  in  S.  Therefore,  we  can  define  a 
hypergraph  edge  corresponding  to  this  reduction  tree,  which  contains  the  edges  in  S  (ignoring  the  intermediate  vertices  R  \  S  which 
depend  on  the  particular  tree),  with  the  edge  partition  corresponding  to  the  hyperedge  being  E  for  any  possible  reduction  tree  T  = 

{R,E). 

6.1  Triangular  solve 

First,  we  consider  a  parameterized  family  of  dependency  graphs  G'trsv(^)  associated  with  an  algorithm  for  the  triangular  solve  (TRSV) 
operation.  In  TRSV,  we  are  interested  in  computing  a  vector  x  of  length  n,  given  a  dense  nonsingular  lower-triangular  matrix  L  and  a 
vector  y,  satisfying 

L  •  X  =  y, 

i.e.,  J2j=i  ^ij  '  —  Vi’  ^  {!)  ■  •  ■  )tr}.  A  sequential  TRSV  implementation  is  given  in  Algorithm  1.  For  convenience,  we 


Algorithm  1  Triangular  solve  (TRSV)  algorithm 
X  =  TRSV(L,y,n) 

1  for  z  =  1  to  n 

2  for  j  =  1  to  z  —  1 

3 

4  Xi  =  (yi  -  I  La 


introduced  the  intermediate  matrix  Z  (which  need  not  be  formed  explicitly  in  practice),  and  corresponding  intermediate  ‘update’  vertices 
{Zij  :  i,j  G  {1, . . . ,  n},j  <  i}.  We  see  that  the  computation  of  Zij  for  z  =  {2, . . . ,  rz}  and  some  J  <  z  depends  on  the  computation  of 
Xj,  which  in  turn  influences  the  computations  of  Zjk  for  all  k  <  j. 

6.1.1  Lower  bounds 

For  fixed  rz,  alternative  orders  exist  for  the  summation  on  line  4,  leading  to  multiple  dependency  graphs  G'TRSv(ti)-  However,  any 
order  of  this  summation  must  eventually  combine  all  partial  sums;  therefore,  the  vertices  corresponding  to  the  computation  of  each 
Xi,  i.e.,  Zij  for  all  j  G  {1, . . . ,  z  —  1},  must  be  connected  via  some  reduction  tree.  We  will  define  a  2-dimensional  lattice  hypergraph 
^^TRSV  =  (4trsV)  1^trsv)>  which  will  allow  us  to  obtain  a  communication  lower  bound  for  all  possible  orderings  of  this  computation 
(i.e.,  all  possible  Gtrsv(’t^)).  in  which  we  will  omit  the  output  and  input  vertices  (x  and  y). 


^TRSV  =  {Zij  'i,j  G  {1, . . .  ,z  -  1},Z  >  j}  ) 

Gtrsv  ={g  :  z  e  {1, . . . ,  zz}}  where 

ei  =  {Zij:j  e  {1,. .  .,i-l}}\J{Zki-.kG{i+l,.  ■  .,zz}}  . 

The  hyperedges  Strsv  can  be  enumerated  with  respect  to  the  vector  x  or  y;  the  zth  hyperedge  Ci  G  F^trsv  includes  all  intermediate 
values  which  are  dependencies  of  Xi  (Zij  for  j  G  {1, . . . ,  z  —  1})  or  dependent  on  Xi  (Z^i  for  k  G  {i  +  1, . . . ,  rz}).  This  hypergraph  is 
depicted  in  Figure  2. 

Lemma  6.1.  Any  vertex  separator  of  any  dependency  graph  Gtrsv(zz)  which  subdivides  the  rz(rz  —  1) /2  intermediate  vertices  Z  into 
two  disjoint  sets  of  size  {rf  /2q\  where  2  <  g  <C  zz,  must  have  size  at  least 

X9(Gtrsv(zz))  =fl{n/i/q). 

Proof  Consider  any  vertex  separator  S  on  Gtsrv(zz)  which  subdivides  Z  into  two  sets  of  size  [zz^/2gj.  Any  graph  Gtrsv(zz)  may 
contain  vertices  corresponding  to  x,  Z  or  other  intermediate  vertices  which  are  intermediate  nodes  in  a  reduction  tree,  whose  sum 
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Figure  2:  Depiction  of  the  hypergraph  iFTRSV  along  with  the  inputs  and  outputs;  each  line  of  a  different  color  corresponds  to  a  hyperedge. 


contributes  to  Xi  for  some  i  (including  vertices  from  y  does  not  affect  the  connectivity  of  vertices  in  Z).  We  now  show  that  for 
any  such  separator  S  there  exists  a  hypergraph  edge  cut  C  on  iFxRSV  that  is  at  most  twice  the  size.  The  inclusion  of  any  vertex 
Zij  G  S,  disconnects  this  vertex  from  the  rest  of  the  graph  and  does  not  disconnect  any  other  path  between  two  vertices  in  Z,  since 
such  dependency  paths  all  go  through  the  reduction  tree  and  x.  For  each  such  vertex  {Zij  G  S)  we  add  edges  and  ej  from  i?TRSV 
into  the  cut  C,  which  completely  disconnects  Zij  from  other  vertices  in  Z  in  iFxRSV-  The  inclusion  of  a  separator  vertex  G  S, 
disconnects  all  vertices  which  are  dependencies  of  Xi  (Zij  for  j  G  {1, . . .  ,i  —  1})  from  all  vertices  which  are  dependent  on  Xi  (Z^i 
for  k  G  {i  +  1, . . . ,  n}).  In  this  case,  we  add  edge  Ci  to  C,  which  has  the  same  effect  on  iFxRSV  of  disconnecting  all  vertices  in  Z 
dependent  on  Xi  from  all  of  vertices  which  are  dependencies  of  Xi  in  Z.  Lastly,  for  any  vertices  in  S  which  are  part  of  a  reduction 
tree  that  contributes  to  Xi,  we  add  the  hyperedge  to  cut  C,  disconnecting  all  dependencies  of  Xi  from  its  dependants.  Thus,  C  is  a 
hyperedge  cut  of  FfTRSV  since  each  hyperedge  in  H  corresponds  to  a  unique  partition  of  edges  in  Gtrsv(’t^)  (including  edges  in  the 
reduction  trees),  and  we  have  disconnected  all  hyperedges  corresponding  to  the  set  of  edges  in  Gtrsv(?t^)  which  were  disconnected 
by  separator  S.  Therefore,  since  disconnecting  the  edges  adjacent  to  S  in  GTRsv(t^)  broke  all  paths  between  some  two  partitions  of 
vertices,  so  must  the  hyperedge  cut  G  in  FfTRSV- 

Further,  the  cut  G  which  we  have  thus  constructed  is  at  most  twice  the  size  of  S,  since  we  added  at  most  two  edges  to  G  for  each 
vertex  in  S.  By  Theorem  5.2,  any  f -balanced  cut  of  a  2-dimensional  lattice  hypergraph  is  of  size  n{n/y/q).  Therefore,  any  vertex 
separator  must  be  of  size  at  least  Xq(GTRsv(''T'))  =  Gl 

Theorem  6.2.  Any  parallelization  of  any  dependency  graph  Gtrsv(^)  where  two  processors  compute  \  /2q\  elements  ofZ  (for  some 
2  <  <7  <C  n)  must  incur  a  communication  cost  of 

f^TRsv  =  ^  {n/ sfq)  ■ 

Proof  Let  G  be  any  dependency  graph  Gtrsv(’t^)  for  Algorithm  1.  Every  vertex  in  G  that  has  an  outgoing  edge  to  a  vertex  computed 
by  a  different  processor  (different  color)  must  be  communicated.  Since  two  processors  compute  \ii? /q\  vertices  of  Z,  the  communicated 
set  can  be  bounded  below  by  the  size  of  a  ^-balanced  separator  of  Z  within  Gtrsv-  By  application  of  Lemma  6.1,  the  size  of  any  such 
separator  is  at  least  n{n/y/q).  □ 

Theorem  6.3.  Any  parallelization  of  any  dependency  graph  Gtrsv(^)  where  two  of  p  processors  compute  \jnf/2p\  elements  ofZ 
incurs  the  following  computation  (F),  bandwidth  (W ),  and  latency  (S)  costs,  for  some  &  €  [1,  n], 

Ttrsv  =  ft  {n  ■  b) ,  Wtrsv  =  ^(n) ,  Strsv  =  {n/b) , 


and  furthermore. 


Ttrsv  ■  ‘S'trsv  =  ■ 


Proof  Let  G  be  any  dependency  graph  Gtrsv  for  Algorithm  1.  We  note  that  the  computation  of  Xi  for  i  G  {1, . . . ,  n}  requires 
the  computation  of  Zjk  for  j,  k  G  i}  with  k  <  j.  Furthermore,  no  element  Zim  for  l,m  G  {i  +  1, . . .  ,n}  with  I  <  m  may 

be  computed  until  Xi  is  computed.  Consider  any  subpath  7?.  C  7^  of  the  dependency  path  V  =  {xi, . . .  ,x„}.  We  recall  that  the 
bubble  f5(G,TV)  =  {Vis,Ep)  around  TZ  is  the  set  of  all  computations  that  depend  on  an  element  of  TZ  or  influence  an  element  of  TZ. 
Evidently,  if  7?.  =  {x^, . . . ,  x^j,  the  bubble  includes  vertices  corresponding  to  a  subtriangle  of  Z,  namely,  Z^i  G  Vp  for  k,l  G  {i, . . . ,  j} 
with  I  <  k.  Therefore,  fi{G,TZ)  is  isomorphic  to  Gtrsv(I^I)^  which  implies  that  |yg|  =  0(|7^p)  and  by  Lemma  6.1,  we  have 
Xq{(3{G,TZ))  =  fl{\TZ\/-i/q)-  Since  the  bubbles  for  TRSV  are  2-dimensional  we  apply  Corollary  4.2  with  c?  =  2  to  obtain,  for  some 
b  G  [1,  n], 

Ftrsv  =  fl{n  ■  b) ,  Wtrsv  =  (n) ,  Strsv  =  ^  (n/b) . 


□ 


6.1.2  Attainability 

The  lower  bounds  presented  above  for  triangular  solve,  are  attained  by  the  communication-efficient  execution  blocking  schedule  sug¬ 
gested  in  Papadimitriou  and  Ullman  [17].  Algorithm  2  below  uses  this  blocking  schedule  with  blocking  factor  b  to  compute  the 
triangular  solve.  Our  algorithm  is  similar  to  the  wavefront  algorithm  given  by  Heath  et  al.  [11]. 

The  parallel  Algorithm  2  can  be  executed  using  p  =  n/b  processors.  Let  processor p;  for  I  G  {1, . . . ,  n/bj  initially  own  Ly,  pj  for 
i  G  n},  j  G  {(I  —  l)b  +  1, . . . ,  lb}.  Processor  pi  performs  parallel  loop  iteration  I  at  each  step  of  Algorithm  2.  Since  it  owns 
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Algorithm  2  Parallel  triangular  solve  (TRSV)  algorithm 

X  =  TRSV(L,y,n) 

1  X  =  y 

2  for  fc  =  1  to  njh 

3  //  Each  processor  pi  executes  a  unique  iteration  of  below  loop 

4  parallel  for  I  =  max(l,  2k  —  n/b)  to  k 

5  if ;  >  1 

6  Receive  length  b  vector  x[(2fc  —  I  —  1)6  +  1  :  {2k  —  l)b]  from  processor  pi_i 

1  for  i  =  {2k  —  I  —  1)6  +  1  to  {2k  —  l)b 

8  for  j  =  (?  —  1)6  +  1  to  min(i  —  1,  lb) 

9  Xi  —  {Xi  biij  *  Xj ) 

10  if  k  =  I 

11  Xi=Xi/Lu 

12  if  ^  <  n/6 

13  Send  length  6  vector  x[(2fc  —  I  —  1)6  +  1  :  {2k  —  l)b]  to  processor 

14  parallel  for  I  =  max(l,  2fc  +  1  —  n/b)  to  k 

15  if;>l 

16  Receive  length  6  vector  x[(2fc  —  /)6  +  1  :  {2k  —  I  +  1)6]  from  processor  p;_i 

17  for  i  =  {2k  —  l)b  +  1  to  {2k  —  I  +  1)6 

18  for  j  =  (?  —  1)6  +  1  to  ^6 

19  Xi  —  {Xi  ^ij  *  ^j) 

20  if  ^  <  n/6 

21  Send  length  6  vector  x[(2fc  —  /)6  +  1  :  {2k  —  I  +  1)6]  to  processor 


the  necessary  panel  of  L  and  vector  part  Xj,  no  communication  is  required  outside  the  vector  send/receive  calls  listed  in  the  code.  So  at 
each  iteration  of  the  outer  loop  at  least  one  processor  performs  0{b^)  work,  and  26  data  is  sent,  requiring  2  messages.  Therefore,  this 
algorithm  achieves  the  following  costs, 

^TRSv  =  0{nb),  ITtrsv  =  0{n),  S'trsv  =  0{n/b), 

which  attains  our  communication  lower  bounds  in  Theorems  6.2  and  6.3  for  any  6  €  {1,  n}.  Parallel  TRSV  algorithms  in  current  numer¬ 
ical  libraries  such  as  Elemental  [18]  and  ScaLAPACK  [6]  employ  algorithms  that  attain  our  lower  bound,  modulo  an  extra  0(log(p)) 
factor  on  the  latency  cost,  due  to  their  use  of  collectives  for  communication  rather  than  the  point-to-point  communication  in  our  wave- 
front  TRSV  algorithm. 

6.2  Gaussian  elimination 

In  this  section,  we  show  that  the  Gaussian  elimination  algorithm  has  3-dimensional  bubble-growth  and  dependency  graphs  which  satisfy 
the  path  expansion  properties  necessary  for  the  application  of  Corollary  4.2  with  d  =  3.  We  consider  factorization  of  a  symmetric 
matrix  via  Cholesky,  as  well  as  Gaussian  elimination  of  a  dense  nonsymmetric  matrix.  We  show  that  these  factorizations  of  n-by-n 
matrices  form  an  intermediate  3D  tensor  Z  such  that  Zijk  G  Z  for  i  >  j  >  k  G  {1, . . . ,  n}  and  Zijk  is  dependent  on  each  Zimn 
for  I  >  m  >  n  G  {1, ...  j  —  1} ■  We  assume  that  a  fast  matrix  multiplication  algorithm  is  not  used,  though  we  conjecture  that  our 
analysis  can  be  extended  to  account  for  potential  use  of  Strassen’s  matrix  multiplication  algorithm  and  likely  other  fast  algorithms  for 
multiplication. 

6.2.1  Cholesky  factorization 

The  Cholesky  factorization  of  a  symmetric  positive  definite  matrix  A  is 

A  =  L  L^, 

for  a  lower-triangular  matrix  L.  A  simple  sequential  algorithm  for  Cholesky  factorization  is  given  in  Algorithm  3.  We  introduced  an 
intermediate  tensor  Z,  whose  elements  must  be  computed  during  any  execution  of  the  Cholesky  algorithm  (although  Z  itself  need  not  be 
stored  explicitly  in  an  actual  implementation).  We  note  that  the  Eloyd-Warshall  [10,  25]  all-pairs  shortest-path  graph  algorithm  has  the 
same  dependency  structure  as  Cholesky  for  undirected  graphs  (and  Gaussian  Elimination  for  directed  graphs),  so  our  lower  bounds  may 
be  easily  extended  to  this  case.  However,  interestingly  our  lower  bounds  for  this  graph  algorithm  are  not  valid  for  the  all-pairs  shortest- 
paths  problem  in  general,  which  may  alternatively  be  solved  via  path-doubling  (a  technique  which  naively  incurs  an  extra  computational 
cost,  but  may  be  augmented  to  have  the  same  asymptotic  costs  as  matrix  multiplication,  as  shown  by  Tiskin  [23]). 
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Algorithm  3  Cholesky  factorization  algorithm 
L  =  CHOLESKY(A,n) 

1  for  j  =  1  to  n 

2  ~  \J ~  X]i=l  Ljk  •  Ljk 

3  for  z  =  j  +  1  to  n 

4  for  A:  =  1  to  j  —  1 

5  ^ijk  —  •  Ljk 

6  Lij  =  (Aij  —  J2k=l  ^ijk)/Ljj 


6.2.2  LU  factorization 

The  LU  factorization  of  a  square  matrix  A  is 

A  =  L  U, 

for  a  lower-triangular  matrix  L  and  a  unit-diagonal  upper  triangular  matrix  U  (we  make  U  rather  than  L  have  a  unit-diagonal  for 
notational  convenience).  A  simple  non-pivoted  algorithm  for  LU  factorization  is  given  in  Algorithm  4.  Within  the  computation  of 

Algorithm  4  LU  factorization  algorithm 


L,U  =  LU(A,n) 

1  for  j  =  1  to  rz 

2  for  i  =  1  to  j  —  1 
for  k  =  1  to  z  —  1 

^jik  —  kj^k  ’  Ukj 


3 

4 

5 

6 

7 

8 
9 

10 


ik  '  ^  kj 

Uij  =  Aij  —  Zjik 

~  ~  'l2k=l  '  ^kj 


for  i  =  j  -I- 1  to  zz 

for  k  =  1  to  j 

Zijk  —  AvjZ; 


Ukj 

Z-l 


Lij  —  (Aij  Zijk)lL 
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LU  factorization,  the  rz^/6  intermediate  vertices  in  Z  within  Algorithm  4  are  analogous  to  the  intermediate  vertices  of  the  Cholesky 
computation  of  the  previous  section.  The  vertices  designated  as  Z  in  the  LU  computation  are  ignored  in  our  further  analysis.  Ignoring 
vertices  does  not  invalidate  our  lower  bounds  because  they  can  only  necessitate  more  work  and  communication. 

6.2.3  Lower  bounds 

We  note  that  the  summations  on  lines  2  and  6  of  Algorithm  3  as  well  as  lines  6  and  10  of  Algorithm  4  can  be  computed  via  any 
summation  order  (and  will  be  computed  in  different  orders  in  different  parallel  algorithms).  This  implies  that  the  summed  vertices 
are  connected  in  any  dependency  graph  G'ge(?^),  but  the  connectivity  structure  may  be  different.  We  define  a  3-dimensional  lattice 
hypergraph  TTge  =  (Vge,  ^'Ge)  for  the  algorithm  which  allows  us  to  obtain  a  lower  bound  for  any  possible  summation  order,  as 


Uge  ={Zijk  ■■  i,j,k  j  >  k}, 

Ege  ={ei,j  with  z  >  j}  where 

Cij  =  {Zijk  :  fc  e  {1, . . . ,  j  -  1}}  U  {Zikj  :  fc  e  {j  -I-  1, . . .  ,z  -  1}}  U  {Zkij  :  k  €  {i  +  1, . .  ■  ,zz}} 

Figure  3(a)  displays  the  intermediate  vertices  of  Hge(^Q)-  We  enumerate  the  set  of  hyperedges  Eqe  via  elements  Cij  G  Eqe, 
i,j  G  {l,...,zz}  withz  >  j. 

Lemma  6.4.  Any  vertex  separator  S  within  dependency  graph  GoEin)  that  subdivides  the  intermediate  vertices  Z  into  two  sets  of  size 
at  least  [rz^/SgJ  (where  2  <  g  <C  zz)  must  have  size  at  least 

Xq(GGE(n))  =  n  (zz^/g^/^)  . 

Proof.  We  show  that  for  any  such  separator  S  in  Gqe,  it  is  possible  to  construct  a  hyperedge  cut  G  of  Hge  which  consists  of  at  most 
SIS'!  hyperedges.  The  separator  S  may  include  vertices  in  Z,  in  L,  or  in  a  reduction  tree  that  contributes  to  an  entry  in  L.  In  the  first 
case,  if  S  includes  an  entry  Zijk,  then  this  entry  is  disconnected  entirely  from  the  graph,  while  the  connectivity  of  other  vertices  in  Z  is 
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(a)  (b) 

Figure  3:  These  diagrams  show  (a)  the  vertices  in  Vqe  with  n  =  16  and  (b)  the  hyperplane  Xi2  and  hyperedge  ei2,6  on  J^ge- 


not  affected.  For  each  such  entry  Zijk  €  S  we  add  edges  Cij,  et^k  and  ej^k  to  C,  effectively  disconnecting  Zijk  within  the  hypergraph 
Hqe-  In  the  latter  two  cases,  if  S  includes  entry  or  an  entry  to  a  reduction  tree  that  contributes  to  Lij,  we  add  edge  Cij  to  C. 
Including  the  entry  Lij  or  a  intermediate  in  a  reduction  tree  which  contributes  to  it  disconnects  vertices  which  are  dependent  on  Lij 
from  the  dependencies  thereof.  These  dependencies  are  encoded  in  the  hypergraph  iTcE  by  the  edge  j,  the  removal  of  which  serves 
to  break  all  possible  paths  that  could  have  gone  through  or  the  reduction  tree  in  iTcE-  Now  we  ascertain  that  C  is  a  hyperedge  cut 
of  iFcE  since  each  hyperedge  in  H  corresponds  to  a  unique  partition  of  edges  in  GGE(ft)  (including  edges  in  the  reduction  trees),  and 
we  have  disconnected  all  hyperedges  corresponding  to  the  set  of  edges  in  GoEin)  which  were  disconnected  by  separator  S.  Therefore, 
since  disconnecting  the  edges  adjacent  to  S  broke  all  paths  between  some  two  partitions  of  vertices  in  GGE(ft),  the  hyperedge  cut  C 
must  disconnect  the  same  partitions  in  TTge- 

For  each  vertex  in  S,  we  have  added  at  most  3  edges  to  the  hyperedge  cut  G,  and  have  disconnected  the  same  or  larger  sets  of  vertices 
within  the  hypergraph  in  each  case.  By  Theorem  5.2,  a  ^-balanced  cut  of  the  vertices  Z  in  Hqe  is  of  size  Therefore,  any 

vertex  separator  on  a  GcEin)  must  be  of  size  at  least  Xq{GGE{n))  =  Ll{jn? / Gy  □ 

Theorem  6.5.  Any  parallelization  of  any  dependency  graph  GoEin),  where  two  processors  each  compute  [n^/SgJ  elements  ofL  (VgeA 
must  incur  a  communication  of 

H^ge  =  ^  . 

Proof  For  any  GcE{n),  every  vertex  that  has  an  outgoing  edge  to  a  vertex  computed  by  a  different  processor  (different  color)  must  be 
communicated.  Since  two  processors  each  compute  \rif  /iq\  elements  of  Z,  the  communicated  set  can  be  bounded  below  by  the  size  of 
a  i-balanced  separator  of  the  vertices  Z  in  Gge(’T')-  By  Lemma  6.4,  the  size  of  any  such  separator  is  Ltynf  / q^Gy  □ 

Theorem  6.6.  Any  parallelization  of  any  dependency  graph  Gge(?t-)  in  which  two  ofp  processors  compute  \  nf  /ip\  vertices  ofZ  incurs 
the  following  computation  (F),  bandwidth  (W },  and  latency  (S)  costs,  for  some  6  €  [1,  n], 

Fqe  =  Ll{n  ■h'^)  ,  Wge  =  Ll{n-  b) ,  Sqe  =  ^  (n/b) , 

and  furthermore, 

Fqe  ■  Sqe  =  II  {nf)  ,  Wge  ■  ‘S'ge  =  H  {n^)  ■ 

Proof  Let  G  be  a  dependency  graph  of  Gge(?^)-  We  note  that  the  computation  of  La  for  i  G  {1, . . .  ,n}  requires  the  computation 
of  Zimk  for  l,m,k  G  with  I  >  m  >  k.  Furthermore,  no  element  Zg^t  for  s,r,  f  G  {i  +  1, . . .  ,n}  with  s  >  r  >  t 

can  be  computed  until  La  is  computed.  Consider  any  subpath  7?.  C  T"  of  the  dependency  path  V  =  {Ln, . . .  ,L„„}.  Evidently,  if 
TZ  =  {La, . . . ,  Lj+ij+i},  the  bubble  f]{G,TZ)  =  (Vg,  Ejs)  includes  vertices  corresponding  to  a  subcube  of  Z,  namely  Zkim  S  for 
k,l,m  G  with  k  >  I  >  m.  Therefore,  j3{G,TZ)  is  isomorphic  to  Gge(|I^|),  which  implies  that  iVaj  =  0(|7^|^)  and  by 

Lemma  6.4,  we  have  Xq{l^{G,  TZ))  =  Q{\7Z\'^ /q'^Gy  Since  we  have  3-dimensional  bubbles  with  2-dimensional  cross-sections,  we  apply 
Corollary  4.2  with  d  =  3  to  obtain,  for  some  b  G  [1,  n], 

Fqe  =  LI  {n  ■  b'^)  ,  ILge  =  Ll{n-  b) ,  Sqe  =  H  {n/b)  ■ 


□ 
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Bubble 


Figure  4:  Depiction  of  the  bubble  (blue  parallelogram)  along  a  dependency  path  (red  dashed  path)  within  a  Krylov  basis  computation  on 
a  1 -dimensional  mesh  (2-point  stencil).  Edges  within  the  bubble  are  colored  according  to  the  hypergraph  edge  to  which  they  correspond. 

6.2.4  Attainability 

The  lower  bounds  presented  in  the  previous  section  are  attained  on  p  processors  for  b  «  nj ^Jp  by  ‘2D  algorithms’,  which  utilize 
a  blocked  matrix  layout  and  are  employed  by  most  standard  parallel  libraries  (including  Elemental  [18]  and  ScaLAPACK  [6]).  The 
BSP  [16]  algorithms  presented  by  Tiskin  [16]  for  LU  (without  pivoting  and  with  pairwise  pivoting  [21])  and  QR  factorization  with 
Givens  rotations,  match  the  lower  bounds  in  Theorem  6.5  and  Theorem  6.6  for  any  h  S  n! y/p].  Therefore,  Tiskin’s  algorithms 

can  lower  the  bandwidth  cost  with  respect  to  2D  algorithms  by  a  factor  of  up  to  at  the  cost  of  raising  latency  by  the  same  factor. 
Similarly,  2.5D  algorithms  [20]  for  LU  factorization  without  pivoting  and  with  tournament  pivoting  [8]  also  lower  bandwidth  cost  by 
a  factor  of  up  to  p^/®  by  sacrificing  latency  cost.  2.5D  algorithms  are  practical  and  can  improve  upon  the  performance  of  standard  2D 
algorithms  so  long  as  the  matrix  is  large  enough  to  amortize  synchronization  overheads.  Therefore,  the  latency-bandwidth  trade-off  is 
of  particular  importance  for  this  problem. 

We  note  that  the  3D  parallel  LU  algorithm  given  by  Irony  and  Toledo  [12],  a  major  motivation  for  some  of  the  communication- 
efficient  algorithms  in  the  last  paragraph,  is  not  optimal  in  our  model,  because  it  does  not  minimize  bandwidth  cost  along  the  critical 
path,  but  only  communication  volume.  This  suboptimality  is  justified  by  the  more  positive  performance  results  for  2.5D  algorithms 
collected  in  [20]  and  [19],  with  respect  to  the  performance  observed  by  the  LU  implementation  of  Irony  and  Toledo  [12]. 

6.3  Krylov  basis  computation 

We  consider  the  s-step  Krylov  subspace  basis  computation 

x«  =A.x('-i), 

for  Z  e  {1, . . . ,  s}  and  x^^^  given  as  input,  where  the  graph  of  the  symmetric  sparse  matrix  A  is  a  (2m  -f  Ij'^-point  stencil  (with  m  >  1), 
i.e.,  d-dimensional  n-by--  •  •  -by-n  mesh  T,  and  each  entry  in  A  represents  an  interaction  between  vertices  G  T,  such 

that  for  k  G  {1, . . . ,  d},  ik,jk  S  {1, . . .  ,n},  \jk  —  ik\  <  'm--  Thus,  matrix  A  and  vectors  x^^^,  I  G  {0, . . . ,  s},  have  dimension  n'^, 
and  A  has  nonzeros  per  row/column.  We  note  that  the  dependency  structure  of  this  computation  is  analogous  to  direct  force 

evaluation  in  particle  simulations  and  the  Lord-Lulkerson  shortest-path  algorithm,  which  may  be  expressed  as  sparse-matrix  times  vector 
multiplication  but  on  different  algebraic  semirings. 

Theorem  6.7.  Any  parallel  execution  of  an  s-step  Krylov  subspace  basis  computation  for  a  (2m  -f  l)‘^-point  stencil,  for  m  >  1,  on  a 
d-dimensional  mesh  with  d  s,  requires  the  following  computational,  bandwidth,  and  latency  costs  for  some  b  G  {1, . . .  s}, 

FKr  =  ^{m‘‘-b^-s),WKr  =  ^{m‘^-b'^~^-s),SKr  =  ^{s/b). 

and  furthermore, 

FKr  ■Si,  =  n  (m'"  •  ,  VkKr  •  =  fl  {m‘^  ■  s'')  . 

Proof  In  the  following  analysis,  we  will  discard  factors  of  d,  as  we  assume  cZ  <C  s  (d  is  a  constant),  which  is  reasonable  for  most 
problems  of  interest  {d  G  {2, 3}),  but  some  assumptions  in  this  analysis  may  need  to  be  revisited  if  more  precise  consideration  of  high¬ 
dimensional  meshes  is  desired.  We  let  Gxr  =  (]^Kr,  Fxr)  be  the  dependency  graph  of  the  s-step  Krylov  subspace  basis  computation 
defined  above.  We  index  the  vertices  Vkt  5  with  d-\-l  coordinates,  each  corresponding  to  the  computation  of  an  intermediate 

vector  element  for  I  G  {1, . . . ,  s}  and  k  =  (*^bis  assumes  the  lexicographical  ordering  of  {1, ... ,  n}'').  Lor  each  edge 

(t'zi....,id,'w;ji....,jJ  in  T  and  each  Z  S  {1, . . .  ,s},  there  is  an  edge  in  Eki-. 

Consider  the  following  path  V  and  any  subpath  TZ,  where  \R\  =  r  >  3, 

V  G>  TZ  =  {t^l,...,l,/t+l,  ■  ■  •  ,  'fl,...,l,/t+r}  ■ 
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The  dependency  bubble  f3{GKr,T^)  =  includes 


Vp  :ij  <  m  ■  min(id+i  -  h-  l,h  +  r  -  id+i),j  €  {1, . .  .,d}, 

h  +  1  <  id+i  <  h  +  r]. 


For  each  (ui, . . . ,  Ud+i)  €  {1, .  ■  ■  ,r  —  2}*^+^  we  define  the  block 

G  Vp  :ij  e  {\m/2]{uj  -  1)  +  1, . . . ,  \m/2]uj)},j  € 
id=i  =  Ud+i  +  h  +  l} . 

Thus,  each  block  should  contain  [7^/2]"^  vertices  on  the  same  level,  Ud+i-  We  note  that  because  the  breadth  of  the  blocks  is  [m/2]  and 
the  interaction  distance  (stencil  radius)  is  m,  every  vertex  in  depends  on  every  vertex  in  for  each 

j  e  {1, . . . ,  d},  as  well  as  on  vertices  Bu^^...,Uj,...,Ud+i-i- 

We  now  construct  a  graph  G'^j.  =  f^Kr)  corresponding  to  the  connectivity  of  the  blocks  within  the  given  bubble  /3(GKr,  B), 
enumerating  them  on  a  lattice  of  breadth  g  =  [(r  —  2)/(ci  +  1)J .  For  each  (mi,  . . . ,  Ud+i)  G  {1, . . . ,  we  have  Wui,...,ud+i  G 

corresponding  to  block  Bu^^...^ud,t  C  Vkr  with  t  =  ^j-  Each  vertex  is  connected  to  vertices  for 

j  G  {1, . . . ,  d  +  1},  by  edges  in  A  representative  bubble  is  shown  for  d  =  1  in  Figure  4,  where  it  is  evident  that  the  bubble  vertices 
can  be  enumerated  on  a  skewed  lattice  as  above. 

We  transform  into  a  hypergraph  id  =  (Vff,  Eh),  so  that  for  q  sa  f -balanced  separator  of  G]^j.  is  proportional  to  a  f -balanced 
hyperedge  cut  of  H.  We  define  Vh  =  G  :  ii  <  72  <  •  •  •  <  id+i},  and  the  hyperedges  Eh  correspond  to  unions  of 

vertices  adjacent  to  disjoint  subsets  of  edges  in  G]^^.  In  particular,  we  define  hyperedges  G  Eh  for  ii,. . .  ,id  G  {1,. . .  ,g} 

with  ii  <  ■  ■  ■  <  id,  to  contain  all  which  satisfy  ji  <  •  •  •  <  jd+i  and  {ii, . . . ,  id}  C  {ji, . . .  ,jd+i}-  We  can  form  these 

hyperedges  as  unions  of  edges  in  G'j^^., 

il-l  *2-1  9-1 

t^i-i,...,id  tG  ,...,id)  U  ,^,72  ■■  ■  id  5  ICii  ,fc+l,-i2 )  U  .  .  .  U 

k=l  k=ii  k=id 

where  each  pair  of  vertices  in  the  union  corresponds  to  a  unique  edge  in  Because  these  hypergraph  edges  correspond  to  disjoint 
subsets  of  any  vertex  separator  of  G'^^^  can  be  transformed  into  a  hyperedge  cut  G  of  id  of  the  same  size  or  less  formed  by  taking 
the  hypergraph  edges  in  id  to  which  the  vertices  in  the  separator  are  adjacent  to.  Further,  any  such  f -balanced  separator  of  G^^  cannot 
be  smaller  than  |C'|  because  each  vertex  is  adjacent  to  no  more  than  d+1  edges. 

We  note  that  the  hypergraph  id  is  a  lattice  hypergraph  of  dimension  d+1  and  breadth  g  =  [(|7^|  —  2)/{d  +  1)J .  By  Theorem  5.2, 
its  f-balanced  hyperedge  cut  has  size  |G|  >  tq{E[)  =  ^{g'^ / q'^Ed+t)y  Furthermore,  since  the  ^-balanced  separator  of  /3{G'^j.,  TZ)  is  at 
least  3^|G| 

X,(«G'k„R))  =  Si  =  !i(|K|-). 

where  the  last  bound  follows  since  we  have  d,q  ^  s. 

This  lower  bound  on  edge  cut  size  in  the  block  graph  G]^^  allows  us  to  obtain  a  lower  bound  on  the  size  of  any  ^ -balanced  separator 
of  Gxr,  that  is  larger  by  a  factor  of  n(m‘^).  Consider  any  ^-balanced  separator  of  Gxr  that  separates  the  vertices  into  three  disjoint 
subsets,  the  separator  Q,  and  the  parts  Vi  and  V2-  If  two  vertices  u,v  G  Vkr  are  in  two  different  partitions  (are  of  different  color),  u  G  Vi 
and  V  G  V2,  and  are  in  the  same  block,  all  vertices  in  the  d  adjacent  blocks  must  have  all  their  vertices  entirely  in  Q  (since  all  vertices 
in  adjacent  blocks  in  G]^^  are  adjacent  to  u  and  v  in  GKr)-  Therefore,  the  number  of  blocks  which  contain  vertices  of  different  color 
is  less  than  \Q\/m'^  and  therefore  small  with  respect  to  |V//j.|/g.  Therefore,  Q  should  yield  n(|  blocks  which  contain  vertices 

that  are  either  in  the  separator  or  in  Vi  and  similarly  for  V2-  Now,  since  two  blocks  Bi  C  (Vi  U  Q)  and  B2  C  (V2  U  Q)  which  contain 
non-separator  vertices  of  different  color  may  not  be  adjacent,  there  must  exist  a  separator  block  B3  C  Q  for  any  path  on  the  lattice 
between  Bi  and  i?2.  Therefore,  Q  also  induces  a  separator  of  G'^^^  of  size  no  larger  than  |(5|  •  [m/2]'^.  So,  we  obtain  the  following 
lower  bound  on  the  size  of  a  separator  of  Gkx^  X9(/3(GKr,  =  HCm”^  •  Xg(/?(GK^,  7^)))  = 

Now,  the  bubble  size  is  \Vp\  =  n(m‘^|7^|'^'''^)  and  the  total  length  of  our  main  dependency  path  is  \V\  =  s.  By  Definition  4.1,  Gxr 
is  a  (e,  ct) -path-expander  with  e{h)  =  and  a{h)  =  Therefore,  by  application  of  Theorem  4.1,  with  F  =  3,  we  obtain  the 

following  lower  bounds,  for  some  b  G  [3,  s]. 


□ 


6.3.1  Attainability 

A  parallel  communication-avoiding  algorithm  for  computing  a  Krylov  subspace  basis,  termed  ‘PAF,  is  presented  in  [9].  We  note  that 
although  PAl  performs  redundant  computation  to  lower  the  parallel  latency,  the  amount  of  redundant  work  and  reduction  in  messages 
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made  possible  by  redundant  work  are  not  asymptotically  significant,  and  thus  the  lower  bounds  of  Theorem  6.7  apply.  Computing  an 
s-step  Krylov  subspace  basis  with  a  (2m  +  l)‘^-point  stencil  with  block  size  b  G  {1, . . .  s}  can  be  accomplished  by  s/b  invocations  of 
PAl  with  basis  size  parameter  b.  The  costs  for  the  overall  computation  using  PAl  are  then 


FKr  =  l-  Oib^+^m^^) 

WKr  =  l-  0(6'^m‘^) 

0 

SKr=l-  0(1) 

0 


=  0{m'^  ■  b<^  •  s), 
=  0(m^  •  5''-!  •  s), 
=  0{s/b), 


under  the  assumption  =  0{bm).  This  algorithm  therefore  attains  the  lower  bounds  and  lower  bound  tradeoffs  of  Theorem  6.7. 


7  Conclusion 

Our  lower  bounds  showed  that  many  numerical  problems  which  have  lattice  dependency  structure,  require  execution  costs  which  are 
independent  of  the  number  of  processors  but  dependent  on  the  problem  size.  Architecturally,  our  results  provided  lower  bounds  on 
execution  time,  as  a  function  of  synchronization  latency  (a),  communication  throughput  (/3),  and  clock-speed  (7).  The  tradeoffs  we 
derive  describe  the  strong  scaling  limit  of  Gaussian  Elimination  and  Krylov  basis  computation  in  terms  of  these  three  quantities.  In  other 
words,  we  obtained  bounds  on  the  time  it  takes  for  any  number  of  processors  to  solve  a  system  of  linear  equations  via  certain  numerical 
algorithms  based  on  the  network  and  clock  speed  of  each  processor.  An  interesting  piece  of  future  work  will  be  to  consider  Krylov  basis 
computations,  which  are  analogous  to  the  Ford-Fulkerson  single-source  shortest-paths  graph  algorithm,  on  graphs  such  as  expanders  and 
binary  trees  rather  than  just  stencils  (grids),  which  our  graph-based  bubble-expansion  formulation  should  allow. 
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